# Low-power and Fast Monolithic 8-2 Adder Compressor Circuit with Efficient Carry Propagation

Thomas Fontanari<sup>\*</sup>, Gustavo M. Santana<sup>\*</sup>, Guilherme Paim<sup>\*</sup>, Leandro M. G. Rocha<sup>\*</sup>, Eduardo A. C. da Costa<sup>†</sup>, Sergio Bampi<sup>\*</sup>

\*Graduate Program on Microelectronics (PGMicro) - Federal University of Rio Grande do Sul (UFRGS) Porto Alegre - Brazil

<sup>†</sup>Graduate Program on Electronic Engineering and Computing - Catholic University of Pelotas (UCPel) Pelotas - Brazil

*Abstract*—Adder compressor architectures have been widely used in multipliers and have recently achieved improvements over conventional approaches in the computation of transform blocks in the context of video coding. This paper reviews four different 8-2 adder compressor architectures and proposes a new one, with more efficient carry propagation. The compressors are synthesized to the ST 65 nm commercial standard cell library using Cadence Genus synthesis tool. The results show the proposed 8-2 adder compressor architecture achieve improvements both with respect to clock timing and power dissipation, demonstrating the importance of considering how the carry propagates when designing adder compressor circuits.

Keywords—adder compressor architectures; 8-2 adder compressor;

# I. INTRODUCTION

The adder compressor is a circuit whose purpose is to simultaneously sum more than two operands by compressing them to only two operands. Hence, a N-2 compressor compresses N operands into only two, such that their sum is equivalent to the sum of the N operands. A final recombination stage is thus necessary to perform the actual sum of the two remaining operands. Different compressor topologies have been designed and implemented, and show improvements in both power dissipation and operating frequency when compared to conventional adders. For instance the 3-2, 4-2, 5-2 [1] and 7-2 [2] compressors are well studied compressor architectures and have proven to have fast and low-power operation.

Adder compressor architectures have also been extensively used in multipliers and have recently been used in the computation of dedicated transform blocks [3], achieving great improvements over standard approaches. More specifically, the 8-2 compressor has been used in the sum of absolute differences (SAD) architecture in video encoders [4], also attaining improvements over conventional adders, specially regarding power dissipation.

Using compressors for adding multiple operands, however, involves propagating carries across multiple modules. As will be further explored in Section II, this propagation can extend itself through all compressors if the internal signal routing is carelessly implemented. This can cause the critical path for a compressor to be linearly dependent on the size of the operands, when a better limited carry propagation can be achieved.

This paper explores four different existing state-of-theart architectures for 8-2 compressors and proposes a new one, which is essentially a reorganization of the carry propagation in one of the cited architectures. This reorganization, however, limits the carry propagation and enables the architecture to achieve better timing and lower power dissipation. Hence, five different structures for the 8-2 compressor are compared in terms of operating frequency, power dissipation and circuit area. The circuits are synthesized with the Cadence Genus Synthesis Solution [5] tool using the ST 65 nm [6] commercial standard cells library for varying operands widths - 9, 16, 24, 32, 64 and 128 bits.

## II. ADDER COMPRESSORS OVERVIEW

This section begins by exploring the 3-2, 4-2, 5-2 and 7-2 adder compressors. Then, four different hierarchical versions of the 8-2 compressor are presented and compared with respect to their critical paths.

# A. Adder Compressors

Architectures for 1-bit versions of the 3-2, 4-2, 5-2 and 7-2 compressors are shown in Figure 1. These circuits are used to compress the input operands into only two outputs – *Carr y* and *Sum* – which can then be recombined in a way such that the result is equal to adding all input operands. The actual sum of the inputs can be obtained through the equation  $(Sum + 2 \times Carr y)$ . The *Cout* signals are in turn propagated to the next 1-bit compressor – used as a *Cin* signal – to form N-bit compressors.

If we wanted, for instance, to add four operands, with 16 bits each, we would need 16 1-bit 4-2 compressors, i.e., one for each bit. The *Cout* from compressor *i* would then be the *Cin* for compressor *i*+1. Each compressor would also output one *Sum* bit and one *Carry* bit, resulting in two operands *Sum* and *Carry*, each with a width of 16 bits. The addition  $(Sum + 2 \times Carry + 2^{16} \times Cout_{15})$  in the recombination stage would result in the sum of the four original operands.

To estimate the efficiency of these circuits regarding timing, we need to consider their critical path. Taking as an



Fig. 1: 1-bit versions of the 3-2, 4-2, 5-2 and 7-2 compressors.

example the 1-bit 5-2 compressor, we might at first assume that its critical path is given by four *XOR* gates. Though this is true for the 1-bit version in isolation, this is in fact not the case when a N-bit version of the 5-2 compressor is considered. The reason for this is the fact that the  $Cin_1$  of compressor *i* has a dependency on the  $Cout_1$  of compressor *i*-1. In this case, the critical path ends up increasing to five logic gates – four *XOR* gates plus one *MUX*.

#### B. 8-2 Hierarchical Adder Compressors

Four different state-of-the-art architectures for the 8-2 compressor, designed through hierarchically combining smaller compressors, are shown in Figure 2. These four architectures were explored in the context of video coding in works [7], [8] and also [4].

If we considered only the 1-bit version for each of the 8-2 compressors, we would conclude that their critical paths are given by six, six, five and eight logic gates for architectures (a), (b), (c), and (d), respectively. This is not the case, however, if we considered them in the context of operands with more than one bit. In fact, the critical paths for architectures (a) and (c) are dependent on the size of the operands. This can be seen by observing that, in (c), the value of  $Cout_3$  depends on the value of  $Cin_3$ , which in turn depends on the value of the previous 8-2 compressor  $Cout_3$ , and so forth. This results in a critical path with a dependence on the number of 1-bit 8-2 compressors being used, i.e., the size of the input operands. Similarly, in architecture (a), the value of  $Cout_1$  depends on the value of  $Cin_1$ , culminating in a  $Cout_1$  propagation across all compressors. However, this condition is not seen with with circuits (b) and (d), whose carry propagation is limited. Their critical paths, however, are also longer, reaching ten logic gates for both of them, as is shown in the synthesis results.

Nonetheless, we note here that, for smaller operands widths – such as 9 bits – architectures (a) and (c) might still achieve better results for maximum operating frequency and power dissipation, since the problem arising from the carry propagation will not have a noticeable impact on performance.

## III. PROPOSED MONOLITHIC 8-2 COMPRESSOR

The proposed 8-2 compressor is shown in Figure 3. We refer to this circuit here as monolithic since it was conceived using only *XOR* and *MUX* gates, instead of hierarchically

combining smaller compressors. The end result is a topology that is identical to the topology resulting from the hierarchical compressor (c). The difference comes in the manner wherein the *Cout* signals are routed between the compressors for a N-bit configuration.

Similarly to the hierarchical compressor (c), the 1-bit monolithic 8-2 compressor has only five logic gates in its critical path. When we analyse the N-bit version we note that this circuit doesn't suffer the same problem with carry propagation as the hierarchical one. The value of  $Cout_3$  here is dependent only on the value of the input operands, and on the value of  $Cin_0$ . The  $Cin_0$  signal, in turn, is the  $Cout_0$  value from the previous compressor, which is dependent only on the previous input operand bits. This results in a limited carry propagation in the compressor.

### IV. SYNTHESIS RESULTS AND DISCUSSION

Five different architectures for the 8-2 compressor were synthesized: the four hierarchical versions referenced in Figure 2 and our proposed version. All architectures were synthesized to the ST 65 nm technology library using Cadence Genus Synthesis tool for bit widths of 9, 16, 24, 32, 64 and 128, targeting an operating frequency of 50 MHz, in order to obtain the estimated power dissipation, circuit area and critical path. The maximum clock frequency was obtained by increasing the target operating frequency used in the synthesis until the circuits could operate reliably without timing errors.

# A. Critical Path and Clock Frequency Analysis

The number of logic gates in the critical path in each architecture is shown in Table I. We note how the path in the hierarchical architectures (a) and (c) increase as the number of bits increases, as a result of the carry propagation. However, the other structures don't have this same behaviour. Therefore, the carry propagation should result in a loss in terms of timing for the architectures that do not have a limited carry propagation chain.

Figure 4 shows how the clock frequency is affected by the carry propagation problem. We see that as the number of bits increase, the maximum operating frequency of the architectures with carry propagation rapidly diminishes, while the frequencies of the architectures without propagation issues remain approximately constant.

The 8-2 compressor proposed here performed very well in comparison with the hierarchical version (c), whose only



Fig. 2: Four hierarchical versions of the 8-2 compressor.



Fig. 3: Proposed monolithic 8-2 adder compressor.

difference to the first is the manner in which the carries are connected through the compressors instances. As a consequence of the decreased critical path, resultant of the limited carry propagation, the monolithic version attained improvements over the hierarchical version of about 29%, 74%, 132%, 191%, 400% and 830% for 9, 16, 24, 32, 64 and 128 bits, respectively.

Moreover, deriving from the fact that our proposed 8-2 compressor has only eight logic gates in its critical path, while both the hierarchical versions (b) and (d) have ten, the compressor achieved better results for timing. It resulted in an improvement of 12% over (d) and 23% over (b), for all bit-widths; while architectures (d) and (b) can operate at, respectively, 531 MHz and 484 MHz, the proposed architecture reaches 595 MHz.

TABLE I: Logic Gates in Critical Path

| Compressor       | Operands bit-width |    |    |    |    |     |
|------------------|--------------------|----|----|----|----|-----|
| Architecture     | 9                  | 16 | 24 | 32 | 64 | 128 |
| Monolithic       | 8                  | 8  | 8  | 8  | 8  | 8   |
| Hierarchical (a) | 15                 | 22 | 30 | 38 | 91 | 231 |
| Hierarchical (b) | 10                 | 10 | 10 | 10 | 10 | 10  |
| Hierarchical (c) | 14                 | 21 | 29 | 37 | 90 | 221 |
| Hierarchical (d) | 10                 | 10 | 10 | 10 | 10 | 10  |



Fig. 4: Maximum operating frequency for different 8-2 compressors architectures.

#### B. Power Dissipation Analysis

Figure 5 shows the total power dissipation results for each architecture as a function of the input operands bitwidth. From the plot we see that for small bit-widths the power dissipation is approximately equal for all compressors, but the differences become noticeable as the operands bit-width increases.

Despite having achieved better results with respect to timing constraints, the monolithic version still attained less power dissipation, specially for larger operand bit-widths. In comparison with the hierarchical 8-2 compressor (d), who had a maximum operating clock frequency close to that of the monolithic, the latter dissipated 8.6% less power for 128 bits, 7.9% for 64 bits, and 6.6% for 32 bits. However, we note that since the synthesis wasn't specific to any application, results might change depending on the context in which they area synthesized. The power dissipation results were estimated by the synthesis tool considering a 50% switching activity for the whole circuit.



Fig. 5: Total power dissipation for different 8-2 compressors architectures.

#### C. Circuit Area Analysis

Figure 6 shows the total circuit area used for each of the five architectures, with respect to the operands width. Similarly to what occurs with power dissipation, all circuits use virtually the same area for small operand widths and the differences become more apparent as the width increases.

We note that for 64 and 128 bits, the sizes for the hierarchical architectures (a) and (c) are increasing faster than the other architectures. This is a result of the synthesis tool attempting to keep the circuits functional at the desired target operating frequency of 50 MHz, which incurs in the inclusion of buffers and logic gates with higher strengths.



Fig. 6: Circuit area for different 8-2 compressors architectures.

#### V. CONCLUSION

This paper presented a comparison between four well researched state-of-the-art 8-2 compressors and our proposed 8-2 compressor. Each circuit was synthesized in 65 nm standard cell technology for operands of sizes 9, 16, 24, 32, 64 and 128 bits. The results show the importance of correctly defining how each *Cout* signal is being routed inside the compressor, such that the carry propagation is limited.

Moreover, our proposed circuit also achieved better results for power dissipation. Though this result was obtained without real-world application inputs, the outcomes suggest it might be a good option in some contexts, specially when larger operands are used, such as is the case with multipliers.

#### ACKNOWLEDGMENT

The authors would like to thank CNPq, Capes and Fapergs Brazilian agencies for financial support to our research.

#### REFERENCES

- C.-H. Chang, J. Gu, and M. Zhang, "Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits," *IEEE Transactions* on Circuits and Systems I: Regular Papers, vol. 51, no. 10, pp. 1985– 1997, 2004.
- [2] M. Rouholamini, O. Kavehie, A. P. Mirbaha, S. J. Jasbi, and K. Navi, "A New Design for 7:2 Compressors," in 2007 IEEE/ACS International Conference on Computer Systems and Applications, May 2007, pp. 474– 478.
- [3] G. M. Santana, G. Paim, L. M. G. Rocha, R. Neuenfeld, M. B. Fonseca, E. A. C. da Costa, and S. Bampi, "Using efficient adder compressors with a split-radix butterfly hardware architecture for low-power IoT smart sensors," in 2017 24th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Dec 2017, pp. 486–489.
- [4] B. Silveira, G. Paim, B. Abreu, M. Grellert, C. M. Diniz, E. A. C. da Costa, and S. Bampi, "Power-Efficient Sum of Absolute Differences Hardware Architecture Using Adder Compressors for Integer Motion Estimation Design," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. PP, no. 99, pp. 1–12, 2017.
- [5] "Cadence EDA tools," http://www.cadence.com.
- [6] "ST 65nm Standard Cell Library," www.st.com.
- [7] J. S. Altermann, E. A. C. da Costa, and S. Bampi, "Fast forward and inverse transforms for the H.264/AVC standard using hierarchical adder compressors," in 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip, Sept 2010, pp. 310–315.
- [8] C. M. Diniz, M. B. Fonseca, E. Costa, and S. Bampi, "Enhancing a HEVC interpolation filter hardware architecture with efficient adder compressors," in 2015 IEEE 13th International New Circuits and Systems Conference (NEWCAS), June 2015, pp. 1–4.